Import package

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.1
## ✔ readr   2.1.2     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

The Very First Attempt

fruitfly <- read.csv('fruitfly.csv')

plot(fruitfly$sleep, fruitfly$lifespan)

Intorduction to ggplot2

ggplot() allows us to build up a plot layer by layer.

Put three important features together to draw a graph:

  1. data
  2. geometric objects
  3. coordinate system

You begin a plot with the function ggplot(), it creates a coordinate system that we can add layers to, the first argument of ggplot() is the data to use in the graph, then complete the graph by adding one or more layers to ggplot(data)

geom_XXX() adds a layer of geometric objects to your plot, for example geom_point() creates a scatterplot (many different geom functions for different types of graphs),

each geom_XXX() takes a mapping argument, which is always paired with aes(), mapping variables to visual properties. first, mapping variables to coordinate system

Import Data Set

str(diamonds)
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
summary(diamonds)
##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
## 

First, we want to make a random sample from the diamonds data set:

set.seed(922)

diamonds1 <- diamonds[sample(1:53940, 1000, replace = FALSE), ]

glimpse(diamonds1)
## Rows: 1,000
## Columns: 10
## $ carat   <dbl> 2.01, 0.77, 0.31, 1.31, 1.51, 0.75, 0.56, 0.30, 1.01, 0.31, 0.…
## $ cut     <ord> Very Good, Ideal, Premium, Premium, Ideal, Very Good, Ideal, I…
## $ color   <ord> I, D, H, I, J, H, J, I, J, F, D, G, I, E, I, F, H, F, J, D, E,…
## $ clarity <ord> SI1, SI1, IF, VS2, VS1, SI2, VS2, VVS2, VS2, VS2, SI2, VS1, SI…
## $ depth   <dbl> 61.8, 60.8, 60.8, 61.9, 61.9, 63.0, 62.0, 62.0, 63.3, 61.3, 63…
## $ table   <dbl> 62, 57, 59, 59, 58, 58, 56, 56, 54, 55, 55, 59, 57, 57, 56, 63…
## $ price   <int> 13691, 3251, 739, 6323, 8170, 2180, 1224, 515, 3620, 591, 574,…
## $ x       <dbl> 7.98, 5.94, 4.36, 7.02, 7.28, 5.76, 5.27, 4.29, 6.42, 4.35, 4.…
## $ y       <dbl> 8.07, 5.90, 4.39, 6.98, 7.33, 5.79, 5.31, 4.32, 6.34, 4.39, 4.…
## $ z       <dbl> 4.96, 3.60, 2.66, 4.33, 4.52, 3.64, 3.28, 2.67, 4.04, 2.68, 2.…

Create the Plot

ggplot(diamonds1) + 
geom_point(aes(x = carat, y = price))

ggplot(diamonds1) + 
  geom_point(aes(x = carat, y = price)) + 
  labs(x = "Carat", y = "Price", title = "Figure 1:diamonds price by carat") + 
  theme_bw()

If we put variables in the first line, they are called the global variables. Those variables will always affect the entire plot. In constrast, local variables are those variables we put into the geom_XXX() function, and they will not affect the other part of the plot (they will only affect the geom_XXX() layer. )

In our previous example, there is no different between making aes(x = carat, y = price) global variables and making them local variables.

ggplot(diamonds1, aes(x = carat, y = price)) + 
  geom_point() + 
  labs(x = "Carat", y = "Price", title = "Figure 1:diamonds price by carat") + 
  theme_bw()

Save Plots

We can save a plot to an object.

p1 <- ggplot(diamonds1, aes(x = carat, y = price)) + 
  geom_point() + 
  labs(x = "Carat", y = "Price", title = "Figure 1:diamonds price by carat") + 
  theme_bw()

p1

Mapping Aesthetic Arguments

  • Mapping to size
ggplot(diamonds1, aes(x = carat, y = price)) + 
  geom_point(aes(size = cut)) + 
  labs(x = "Carat", y = "Price", title = "Figure 1:diamonds price by carat") + 
  theme_bw()

  • Mapping to shape
ggplot(diamonds1, aes(x = carat, y = price)) + 
  geom_point(aes(shape = cut)) + 
  labs(x = "Carat", y = "Price", title = "Figure 1:diamonds price by carat") + 
  theme_bw()
## Warning: Using shapes for an ordinal variable is not advised

  • Mapping to color
ggplot(diamonds1, aes(x = carat, y = price)) + 
  geom_point(aes(color = cut)) + 
  labs(x = "Carat", y = "Price", title = "Figure 1:diamonds price by carat") + 
  theme_bw()

We can use multiple mappings:

ggplot(diamonds1, aes(x = carat, y = price)) + 
  geom_point(aes(shape = cut, color = cut)) + 
  labs(x = "Carat", y = "Price", title = "Figure 1:diamonds price by carat") + 
  theme_bw()
## Warning: Using shapes for an ordinal variable is not advised

Note: mapping size for numeric variables,and shape for categorical variables. - Mapping to alpha: transparency

ggplot(diamonds1)+
  geom_point(aes(x=carat, y= price), alpha=0.1)

ggplot(diamonds1)+
  geom_point(aes(x=carat, y= price, alpha=cut))

Note: color and alpha aesthetics can be mapped to either categorical variables or numeric variables.

More Geom Functions

  • geom_line()
  • geom_smooth()
  • geom_histogram()
  • and so on
ggplot(diamonds1, aes(x = carat, y = price)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE, formula = "y~x")

In the example above, aes(x = carat, y = price) is considered to be our global variables, so we do not need to duplicate this code over and over again very time we call those variables. However, if we change it into local variables, we need to duplicate this code many times:

ggplot(diamonds1) + 
  geom_point(aes(x = carat, y = price)) + 
  geom_smooth(aes(x = carat, y = price), method = "lm", se = FALSE, formula = "y~x")

Example We have a pre-set data set called mtcars, and here’s a preview of it:

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Make a scatter plot between wt and mpg, set the color to cyl and shape to am.

ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl), shape = factor(am))) + 
  geom_point()

Color

Three Ways to Change Colors

  • The first way: enter the color name (google ggplot color)
ggplot(diamonds1, aes(x = carat, y = price)) +
  geom_point(alpha = 0.2, size = 3, shape = 6, color = "red") +    # or use number
  labs(x = "Carat", y = "Price", title = "Figure 1:diamonds price by carat") +
  theme_bw()

  • The second way: combine colors
ggplot(diamonds1, aes(x = carat, y = price)) +
  geom_point(alpha = 0.2, size = 3, shape = 6, color = rgb(0.5, 0.7, 0.2)) +
  labs(x = "Carat", y = "Price", title = "Figure 1:diamonds price by carat") +
  theme_bw()

  • The third way: Hex color codes (Google hex color codes)
ggplot(diamonds1, aes(x = carat, y = price)) +
  geom_point(alpha = 0.2, size = 3, shape = 6, color = "#012169") +
  labs(x = "Carat", y = "Price", title = "Figure 1:diamonds price by carat") +
  theme_minimal()

Color of Continuous Variables

ggplot(diamonds1, aes(x = carat, y = price)) + 
  geom_point(aes(color = carat))

If we want to change the color scale, we use the scale_color_gradient() function.

ggplot(diamonds1, aes(x = carat, y = price)) + 
  geom_point(aes(color = carat)) + 
  scale_color_gradient(name = "Price", low = "darkblue", high = "orange")

Color of Categorical Variables

ggplot(diamonds1, aes(x = carat, y = price)) + 
  geom_point(aes(color = clarity)) + 
  scale_color_discrete(name = "Clarity")

If we want to change the color manually, we first need to know how many levels the variable have.

levels(diamonds$clarity)
## [1] "I1"   "SI2"  "SI1"  "VS2"  "VS1"  "VVS2" "VVS1" "IF"

Now, we use scale_colr_manual() to change the color scale.

ggplot(diamonds1, aes(x=carat, y=price))+
  geom_point(aes(color=clarity))+
  scale_color_manual(name="Clarity Title", values=c("red", "darkblue","darkgreen", "grey", "grey3", "black","darkred","darkorange"))

Size

ggplot(diamonds1, aes(x = carat, y = price)) +
  geom_point(aes(size = carat)) +
  scale_size(name = "Carat Size", range = c(3, 8)) +    ## edit legend about size
  labs(x = "carat", y = "price", title = "diamonds price by carat")

size is for quantitative variables (numeric vector)

Shape

is.factor(diamonds1$cut)
## [1] TRUE

Since cut is a discrete variable, we can use it for shape.

ggplot(diamonds1, aes(x = carat, y = price)) +
  geom_point(aes(shape = cut),color = "blue") +
  scale_shape(name = "Cut Types") +  # edit legend about shape
  labs(x= "carat", y = "price", title = "diamonds price by carat")

shape is for categorical variables (factor)

Alpha

ggplot(diamonds1, aes(x = carat, y = price)) +
  geom_point(aes(alpha = price), color = "blue", position = 'jitter') +
 scale_alpha(name = "Price")  #edit legend of opacity

alpha is for continuous variable (numeric vector).

Note: jitter will randomly change the position of the data points slightly to show points that are overlapped.

ggplot(mtcars, aes(x = wt, y = am)) + 
  geom_point(alpha = 0.1)

Some points are darker tha other points, meaning there is a overlapping among data points. Try to use jitter () to separate those overlapping.

ggplot(mtcars, aes(x = wt, y = am)) + 
  geom_point(position = "jitter")

Note: The final plot created is misleading.

Or we can directly use the geom_jitter() function. In the geom_jitter() function, we can specify the settings of jitter.

ggplot(mtcars, aes(x = wt, y = am)) + 
  geom_jitter(width = 0.5, height = 0.01)

Another way is to use position = position_jitter(). Using this argument in the geom_point() function will create the same plot as above because we can also specify settings of jitter inside position_jitter().

ggplot(mtcars, aes(x = wt, y = am)) + 
  geom_point(position = position_jitter(width = 0.5, height = 0.01))

Facet

facet function will put multiple subplots in one plot.

ggplot(diamonds1, aes(x = carat, y = price)) +
  geom_point(aes(color = clarity)) +
  facet_grid(. ~ cut)

ggplot(diamonds1, aes(x = carat, y = price)) +
  geom_point(aes(color = clarity)) +
  facet_grid(clarity ~ cut)

ggplot(diamonds1, aes(x = carat, y = price)) +
  geom_point(aes(color = clarity)) +
  facet_wrap(. ~ cut)

ggplot(diamonds1, aes(x = carat, y = price)) +
  geom_point(aes(color = clarity)) +
  facet_wrap(clarity ~ cut)

Add Labels

ggplot(mpg, aes(cty, hwy)) + 
  geom_point(alpha = 0.7, size = 7, position = "jitter", aes(color = cty)) + 
  geom_text(aes(label = factor(cyl)), 
            color = "white", size = 3, 
            vjust = 1.5, position = position_dodge(0.3), 
            check_overlap = TRUE)

Other Types of Plots

Bar Plot

ggplot(diamonds1, aes(x = cut)) + 
  geom_bar()

ggplot(diamonds1, aes(x = cut)) + 
  geom_bar(aes(fill = clarity))

ggplot(diamonds1, aes(x = cut)) + 
  geom_bar(aes(fill = clarity), position = "dodge")

Use the stat_count() geom function, we will get exactly the same plot. In ggplot2, each specific geometric symbol has an unique specific statistical transformation.

ggplot(diamonds1, aes(x = cut)) + 
  stat_count(aes(fill = clarity), position = "dodge")

Boxplot, Histogram, Density, and Line Graphs

  1. Boxplot is for one quantitative v.s. one categorical
ggplot(diamonds1, aes(x = cut, y = price)) +
  geom_boxplot(aes(col = clarity))

  1. Histogram is for one quantitative variable
ggplot(diamonds1, aes(price)) +
  geom_histogram(bins = 10)

  1. Density is also for one quantitative variable
ggplot(diamonds1, aes(price)) +
  geom_density(aes(color = clarity))

  1. We can add lines to time series data
?economics 
ggplot(economics, aes(x = date, y = unemploy)) +
  geom_line()

Statistical transformations

ggplot(diamonds1, aes(x=cut))+
  geom_bar() 

Each unique geometric object has an unique corresponding statistical transformation

The previous code is the same as the following:

ggplot(diamonds1, aes(x=cut))+
  stat_count()

prop.table(table(diamonds$cut))
## 
##       Fair       Good  Very Good    Premium      Ideal 
## 0.02984798 0.09095291 0.22398962 0.25567297 0.39953652
demo <- tribble(
  ~cut,         ~prop,
  "Fair",       0.0298, 
  "Good",       0.0909,
  "Very Good",  0.2239,
  "Premium",    0.2555,
  "Ideal",      0.3995
)

If we run

ggplot(demo, aes(x = cut, y = prop)) + 
  geom_bar()

in R, we will get an error because the geom function geom_bar() only accept one variable input. Here, we need to add a stat="identity" argument in geom_bar()

ggplot(demo, aes(x = cut, y = prop)) + 
  geom_bar(stat = "identity")

The previous code is equivalent to the following:

ggplot(demo, aes(x = cut, y = prop)) + 
  geom_col()

geom_col() required two variable input, so the previous code can run without any error and any additional argument input.

We can sue coord_flip() to flip the coordinate.

ggplot(demo, aes(x = cut, y = prop)) + 
  geom_col() + 
  coord_flip()